Week-4
Data Visualization on Honey Production dataset using seaborn and matplotlib libraries.
The Goal is to use Python visualization libraries such as seaborn and matplotlib to investigate the data and get some useful conclusions.
Slno. Attribute Description
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn as sk
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
# reading the dataset
df=pd.read_csv("honeyproduction (1).csv")
df
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | AL | 16000.0 | 71 | 1136000.0 | 159000.0 | 0.72 | 818000.0 | 1998 |
| 1 | AZ | 55000.0 | 60 | 3300000.0 | 1485000.0 | 0.64 | 2112000.0 | 1998 |
| 2 | AR | 53000.0 | 65 | 3445000.0 | 1688000.0 | 0.59 | 2033000.0 | 1998 |
| 3 | CA | 450000.0 | 83 | 37350000.0 | 12326000.0 | 0.62 | 23157000.0 | 1998 |
| 4 | CO | 27000.0 | 72 | 1944000.0 | 1594000.0 | 0.70 | 1361000.0 | 1998 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621 | VA | 4000.0 | 41 | 164000.0 | 23000.0 | 3.77 | 618000.0 | 2012 |
| 622 | WA | 62000.0 | 41 | 2542000.0 | 1017000.0 | 2.38 | 6050000.0 | 2012 |
| 623 | WV | 6000.0 | 48 | 288000.0 | 95000.0 | 2.91 | 838000.0 | 2012 |
| 624 | WI | 60000.0 | 69 | 4140000.0 | 1863000.0 | 2.05 | 8487000.0 | 2012 |
| 625 | WY | 50000.0 | 51 | 2550000.0 | 459000.0 | 1.87 | 4769000.0 | 2012 |
626 rows × 8 columns
# checking the shape of this dataset
df.shape
(626, 8)
# checking the size of this dataset
df.size
5008
# getting random samples
df.sample(5)
| state | numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|---|
| 620 | VT | 4000.0 | 60 | 240000.0 | 53000.0 | 2.39 | 574000.0 | 2012 |
| 342 | WY | 40000.0 | 56 | 2240000.0 | 291000.0 | 0.89 | 1994000.0 | 2005 |
| 163 | SD | 235000.0 | 65 | 15275000.0 | 12220000.0 | 0.71 | 10845000.0 | 2001 |
| 118 | PA | 25000.0 | 45 | 1125000.0 | 630000.0 | 0.76 | 855000.0 | 2000 |
| 326 | NM | 7000.0 | 49 | 343000.0 | 113000.0 | 1.03 | 353000.0 | 2005 |
# checking the dtypes
df.dtypes
state object numcol float64 yieldpercol int64 totalprod float64 stocks float64 priceperlb float64 prodvalue float64 year int64 dtype: object
# Examining the information of the Honey production dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 626 entries, 0 to 625 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 626 non-null object 1 numcol 626 non-null float64 2 yieldpercol 626 non-null int64 3 totalprod 626 non-null float64 4 stocks 626 non-null float64 5 priceperlb 626 non-null float64 6 prodvalue 626 non-null float64 7 year 626 non-null int64 dtypes: float64(5), int64(2), object(1) memory usage: 39.2+ KB
# summary statistics
df.describe()
| numcol | yieldpercol | totalprod | stocks | priceperlb | prodvalue | year | |
|---|---|---|---|---|---|---|---|
| count | 626.000000 | 626.000000 | 6.260000e+02 | 6.260000e+02 | 626.000000 | 6.260000e+02 | 626.000000 |
| mean | 60284.345048 | 62.009585 | 4.169086e+06 | 1.318859e+06 | 1.409569 | 4.715741e+06 | 2004.864217 |
| std | 91077.087231 | 19.458754 | 6.883847e+06 | 2.272964e+06 | 0.638599 | 7.976110e+06 | 4.317306 |
| min | 2000.000000 | 19.000000 | 8.400000e+04 | 8.000000e+03 | 0.490000 | 1.620000e+05 | 1998.000000 |
| 25% | 9000.000000 | 48.000000 | 4.750000e+05 | 1.430000e+05 | 0.932500 | 7.592500e+05 | 2001.000000 |
| 50% | 26000.000000 | 60.000000 | 1.533000e+06 | 4.395000e+05 | 1.360000 | 1.841500e+06 | 2005.000000 |
| 75% | 63750.000000 | 74.000000 | 4.175250e+06 | 1.489500e+06 | 1.680000 | 4.703250e+06 | 2009.000000 |
| max | 510000.000000 | 136.000000 | 4.641000e+07 | 1.380000e+07 | 4.150000 | 6.961500e+07 | 2012.000000 |
# checking if there's null value presented
df.isnull().sum()
state 0 numcol 0 yieldpercol 0 totalprod 0 stocks 0 priceperlb 0 prodvalue 0 year 0 dtype: int64
# verifying if there's duplicate value presented
df.duplicated().sum()
0
Inference:
pie_chart=px.pie(df,values='year',names='year', title="Percentage of Honey Distribution over the Years-(Pie Chart)",labels=df['year'].value_counts().index)
# Show percentages and labels inside
pie_chart.update_traces(textposition='inside', textinfo='percent+label')
pie_chart.show()
Inference:
sns.displot(df, x='priceperlb', kde=True)
plt.title('Distplot of Price per lb')
plt.show()
Inference: 1.The distribution of honey prices per pound is right-skewed, indicating a non-normal distribution with a notable group of higher-priced honey. 2.The peak around $ 1.60 suggests that this is the most common price per pound. However, the wide spread from $ 0.80 to $ 4.00 showcases the diverse range of honey prices in the dataset.
px.scatter(df, x='numcol', y='prodvalue', title="Scatter Plot of numcol vs. prodvalue", trendline="ols")
Inference:
px.box(df, x='year', y='prodvalue',title="Boxplot of prodvalue vs year")
Inference:
sns.pairplot(df[['numcol', 'yieldpercol', 'totalprod','prodvalue','year']],hue="year",diag_kind="kde",corner=True)
plt.figure(figsize=(10, 10))
<Figure size 1000x1000 with 0 Axes>
<Figure size 1000x1000 with 0 Axes>
Inference:
1.numcol vs. yield percol:A weak positive correlation suggests that, in general, higher numcol corresponds to higher yield percol, but with considerable scatter. 2.numcol vs. total prod: Strong positive correlation indicates that higher numcol leads to higher total prod, as expected. 3.numcol vs. prodvalue: Moderate positive correlation implies that higher numcol tends to result in higher prodvalue, but other factors play a role. 4.yield percol vs. total prod: Strong positive correlation implies a direct relationship, as total prod is the product of numcol and yield percol. 5.yield percol vs. prodvalue: Moderate positive correlation suggests that higher yield percol tends to lead to higher prodvalue, but other factors contribute. 6.total prod vs. prodvalue:Strong positive correlation indicates that higher total prod corresponds to higher prodvalue, as prodvalue is calculated based on total prod.
sns.pairplot(df[['numcol', 'yieldpercol','totalprod', 'stocks', 'priceperlb', 'prodvalue']],diag_kind="kde",kind="reg",corner=True)
plt.figure(figsize=(10, 10))
<Figure size 1000x1000 with 0 Axes>
<Figure size 1000x1000 with 0 Axes>
columns = ['numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb', 'prodvalue']
# Calculate the correlation matrix
correlation_matrix = df[columns].corr()
# Create a heatmap
plt.figure(figsize=(5, 5))
sns.heatmap(correlation_matrix,annot=True)
plt.title('Correlation Plot of Selected Columns')
plt.show()
Inferences:
Strong positive correlations:
Strong negative correlations:
Weak correlations: